UNIKN-ArmsDealingVis-MC1

VAST 2010 Challenge
Text Records - Investigations into Arms Dealing

Authors and Affiliations:

      Mohammad Najm-Araghi, University of Konstanz, najm.araghi@googlemail.com

      Sergey Pulnikov, University of Konstanz, sergey.pulnikov@googlemail.com 
      Dr. Peter Bak, University of Konstanz, bak@dbvis.inf.uni-konstanz.de

Tool(s):

We have developed our own tool for preprocessing and some visualizations.

To visualize the data we also use Many Eyes online tool. Many Eyes was developed by IBM's research group in 2004.

To draw some of graphs we use network analysis application called visone. The origins of the visone project lie in a single

link between the Algorithms & Data Structures Group in the Department of Computer & Information Science,

and the Domestic Politics & Public Administration Group in the Department of Politics & Management,

both at the Universität Konstanz.To achive the same result as we have, you don't need any knowledge

in programming. With our tool everyone is able to create the same visualizations as we have in less

than an hour.

 

Video:

 

http://ava.dbvis.de/AVA_MC1.mp4

 

ANSWERS:


MC1.1: Summarize the activities that happened in each country with respect to illegal arms deals based on a synthesis of the information from the different report types and sources. State the situation in each country at the end of the period (i.e. the end of the information you have been given) with respect to illegal arms deals being pursued. Present a hypothesis about the next activities you expect to take place, with respect to the people, groups, and countries.

 

1   Introduction

We have developed a tool for data mining from reports that visualize the required information. From each report, we extract locations, persons, organizations and dates. At the next step, we eliminate duplicates and save results in our internal data structure. Our tool provides several ways to create an interactive visualization of the extracted data. To show the summarized activities that happened in each country with respect to illegal arms dealing we have created an interactive World Map Visualization. To make recent activities more important, we also implemented a weight function.

The implementation took about a week of work. Over the whole period we had some manual tasks between the implementation of our tool. Finding previous works that fit in our approach, editing the findings manually and searching appropriate visualizations took a bit more than one week of work.  Finally the  whole project retard about 2 month with a interesting solution that will be presented in the next chapters.

 

2   Knowledge Discovery in Text and Visual Analysis

The reports about illegal arms dealing that we need to analyze are text files. Most reports contain information about locations, persons and sometime organizations that are involved into illegal arms dealing. Sources for some of these reports are telephone calls, mails and blogs.

Our first goal was to extract relevant information from the reports for further analysis. To do this, we have developed our own tool based on Stanford Named Entity Recognizer (NER) [3]. NER labels sequences of words in a text which have names. We extended this library to label dates as well. All duplicates were eliminated. From each report we get for example: Locations: Turkey Persons: Hakan Organizations: United Arab Emirates Date: 16.12.2008.  The manual process that includes the web search for existing tool, took about 3 hours.

 

 

Fig. 1. Bar-plot visualization shows the countries, ordered by frequency of occurrences. 

To summarize the activities that happened in each country (Figure 1), we created a list of all countries that occur in reports with a number of appearances in all reports (without counting duplicates in each report). However, our main interest was to show not only the frequency, but also the temporal deviations in occurrence. In order to achieve this aim, we calculated a weight function that makes occurrences that happened recently more important than ones that occured in the past. Results are shown in Figure 2. The manual analysis took a 2 hours. The extension of the tool was in this case the more weighty task and costs about 4 hours.

Our tool can visualize this data in several ways.

 

·         Interactive bar-chart implemented within the tool

·         Export data for IBM many eyes visualization [4]

 

 

 

 

 

Fig. 2. World map shows the frequency of occurrences weighted by their temporal parameter (recent events are more important) in a combined “interestingness factor”. The saturation of countries increases with the importance of the country.

 

 

An overview over persons and organizations can be also visualized with bar charts in our tool. In order to enhance the quality of the extracted information, we extend our tool to allow manual changes of preprocessing [1]. To make results better, we invested about 20 minutes in manual changes of countries. For example, Moscow was changed to Russia, and so on. To visualize all persons and connections between them, we use Many Eyes and export the corresponding data with our tool.

 

 

 

 

 

Fig. 3.  Social network graph with all actors extracted by our tool. Each sub-graph corresponds to one social network.

 

 

As you can see from this (Figure 3) visualization, not all persons are connected with each other. There are several isolated groups of persons. We now aimed in a combined view of persons and countries. In order to avoid an overload of such a combined graph we implemented a filtered visualization, where we can manually set an importance boundary for country and for person filtering. This filtered visualization provides us with several hypotheses about future activities. The hypotheses were that strong connections and also multiple ones exist between the most important actors and the location of their activities

 

 

 

 

 

                                                                                                      

 

Fig. 4.  A subset of a social network combined with locations of activities.  The most probable arm dealing will take place between Nicolai and Saleh Ahmed, in Russia and Yemen.

 

The filtered view gives more clearance for some assumptions. There are four countries left that participate in the arms dealing network, therefore future events could be :

 

·          Russia will deal with Ukraine

·          Russia, Ukraine and Yemen will be participated in the next deal

 

If we consider the acteurs in the graph, the connections between them and the countries there are some other possible hypothesis based on the cardinality of the nodes.

 

·          Maulana Haq Bukhari will start a deal in Pakistan

·          Muhammad Kasem and Akram Basri will form a vicious triangle

·          Nicolai, Saleh Ahmed will deal within three countries: Russia, Ukraine, Yemen

 

These hypotheses are just based on the number associated with each node. Each connection means that the nodes appeared in the same text corpus.

To extract the most probable theory, we created a time series visualization in the tool that shows us the activity of country/person over the whole time period of reports. The implementation of the time-series visualization took approximately 4 hours. The results are shown in Figure 5.

 

Our results show that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen, as highlighted in Figure 4.

 

 

 

Fig. 5. Similar time-series of Yemen and Russia based on the count (x-axis) and the appearance in the intelligence reports over the whole time period (y-axis).

Our results show that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen, as highlighted in Figure 4.

Mini Challenge 1.2 in the next section will  present further analysis of the social network. To underline our hypotheses in this chapter, we use different network layout and weight the nodes with various measures.

References

1.  U. Fayyad, G. Piatetsky-shapiro, P. Smyth, and T. Widener. The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39:27–34, 1996.

2.  Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)

3.  J. R. F. T. Grenager and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling, 2005. [Online; accessed 15-May-2010].

4.  F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. Mckeon. Manyeyes: a site for visualization at internet scale. IEEE Trans Vis Comput Graph, 13(6):1121–1128, 2007.

 


MC1.2:  Illustrate the associations among the players in the arms dealing through a social network. If there are linkages among countries, please highlight these as well in the social network. Our analysts are interested in seeing different views of the social network that might help them in counterintelligence activities (people, places, activities, communication patterns that are key to the network).

 

1   Introduction

The second task was to visualize the relationships among the players and countries involved in dealing in arms through a social network. This network was to help the counterintelligence to prevent further activities in different countries. The primary objective was to detect the central actors in the network with different analysis methods.

The major problem in this task was the extraction of the entities needed for such a visualization and analysis task. Our dataset consisted of 91 text records. These records are part of newspaper

accounts, emails, message board entries, web sites, blog postings, telephone calls, bank transactions and observations. Our task included the extraction of persons, countries, organizations and even the date for possible further activities from this dataset.

2   Knowledge Discovery in Text and Network Analysis

This introduction about the task and possible ensuing problems shows that the selection, pre-processing and transformation (KDD [2]) steps are very important for a valid analysis and the creation of useful social networks. It was obvious that we needed Named-Entity-Recognition (NER) for the information extraction part. To avoid starting from scratch, our first step was a web retrieval session and consisted of examining existing applications in this area of research. This session took about 2-3 hours. Having tested some tools, we chose a NER system from the Stanford Natural Language Processing Group [3]. After a configuration step for determining the right classifier (classifiers/nereng-ie.crf-3-all2008.ser), we were satisfied with a first console output containing locations, names and organizations.

Based on this output, we were able to start our own implementation of a tool that included almost all KDD steps, except for the evaluation part.

 

 

 

Fig. 1. Extracted actors after the preprocessing step.

 

For a useful data input, we separated the records by blank lines and used each data item as an independent corpus. Therefore, we wrote a little script which separates the records and saves them in different files. The prepared files were going to be the input for our application. Generally, the interface of our application can be separated in three main windows. The first shows the information extraction part. The second supports the manual analysis (Figure 1) and enables us to get an overview with a bar-chart visualization. The third one is suited for the network analysis task. The read reports button on the first tab invokes the pre-processing step as described, and visualizes the output of the different text records. The findings for one item could look like this:

Locations: Nairobi, Kenya, Dubai, Moscow Persons: Nahid Organizations: Tanya Date: 1.04.2009 .

The date category is an extension of the Stanford system. We have also deleted duplicate names, organizations and location entries within one item. In addition, we did some manual work with the summarizing of cities to the according country. The manual editing of the data set is supported by a feature of our application.

 

Another important decision that was based on a manual approach was the weighting of the U.S.A. as node. The cardinality of the U.S.A. in the whole dataset is high. But after a manual examination of the textual records, it turned out the this high cardinality of the U.S.A. is related to the authors of the records. We were interested in the actors of the arms dealing network, therefore it was reasonable to delete the U.S.A as an actor in the network.

In this part, we combined manual and automatical preprocessing. My partner worked 2 hours at the manual preprocessing. Implementing the automatical support took us further hours.

 

 

 

Fig. 2.  Using our tool for analysis of persons, countries and organizations.

 

The resulting output is a great fundament for further analysis. At this point, we had already summarized each event with its important actors, the according date and where the event has taken place. The next step was an import into a relational database system. Therefore, we imported the four needed attributes in the database. Starting with textual information, we achieved a possibility to work with a database management system with all its advantages.

This achieved result turns the determination of the number of each actor into a simple aggregation of all items in one attribute. At the bottom of the first window, this count is shown for all used attributes.

Having finished this challenge, we began to work at the analytical part. We started with a barchart visualization (using an existing java bib.) of the accordant attributes. The y-axis represents the cardinality and the x-axis the different persons, organizations and countries. This visualization in our second interface window is very useful for getting a brief overview and for determining a benchmark for the number of nodes in our social network visualization. The third component of our application supports the visualization of social networks. This part is based on IBMs Many eyes [4].

Our tool enables us to visualize a country network, a social network and a filtered view of these diagrams (Figure 2). The filtering can be accomplished with the bar-chart in the second window. This is of course very useful for a more precise perspective of the main actors and countries.

Our first analysis based on the resulting network showed us that the main actors are separated into 4 distinct groups. The first sub graph contains Thaniti Otieno, Nahid Owiti and Vanjdhi Onyango.

Nicolai Kuryakin and Saleh Ahmed represent the second graph. Maulana Haq Bukhari and Akram Basra build the third, Muhammad Kasem and Abdullah Khouri the fourth graph. These distinct subgraphs provide less information. Therefore, we combined the most important countries with the most frequent actors and their relationships. The result is an interesting graph, as depicted in Figure 3.

This representation leads us to the assumption that Nicolai works with Ukraine and Saleh Ahmed with Yemen.

 

 

 

 

Fig. 3. Social network of all actors above a certain threshold.

 

 

 

 Furthermore, both work with Russia and act as central persons in the whole network. The main countries in the dealing network are Yemen, Russia and Ukraine. The filtering of the other actors and countries is based on the inspection of each actor with the tool on tab three of our application. This reflects that the other actors do not figure as centrally as the aforementioned two persons. To underline this hypothesis we manually integrated the graph into a network analysis application called visone [1]. By analyzing the graph with the betweeness measure and a centralized layout based on this value (Figure 4). It is obvious that the network underlines our hypothesis. If we map the view on the cardinality, two other actors appear: Muhammad Kasem and Akram Basri, who are already known to us from the results of the first step of the analysis.

 

 

 

Fig. 4. Visualization of the betweeness value. This useful to determine central actors.

 

 

 

 

 

Therefore, these actors are also strongly conspicuous.  To determine a hierarchy out of the called person we chose a status layout based on the betweeness. The node size reflects the corresponding value (Figure 5) . This is further evidence for the central role of Russia as country and Saleh Ahmed and Nicolai Kuryakin as players in the arms dealing network.

 

 

                                                                                                                 

Fig. 5.  Hierarchical ordering of the arms deal network.

 

Future activities can also be analyzed with our application. There is a means to visualize the appearance of countries over time. We used an existing Java bib. for this implementation (Figure 2). The result of this extension is that Russia and Yemen had a constant number of occurrences in the arms dealing network and thus ought to be under strong observation.

References

1.  U. Brandes and D. Wagner. Visone, 2009. [Online; accessed 28-May-2010].

2.  U. Fayyad, G. Piatetsky-shapiro, P. Smyth, and T. Widener. The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39:27–34, 1996.

3.  Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)

4.  J. R. F. T. Grenager and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling, 2005. [Online; accessed 15-May-2010].

5.  F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. Mckeon. Manyeyes: a site for visualization at internet scale. IEEE Trans Vis Comput Graph, 13(6):1121–1128, 2007.

 

 

Conclusions of both MC1 and MC2:


 

Mini Challenge 1 shows that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen. After answering the question it is clear that Russia, Ukraine and Yemen are the main settings in the arms dealing network.  The  United Arab Emirates is a beside setting that includes criminal risk. Due to the time series in our tool we can predict possible activities in these countries.

The second Mini Challenge helps us to underline this assumptions and even expand some of them. Saleh Ahmed and Nicolai are main actors in the network and act as a transmitter (maximum betweeness value) for all criminal activities. Building  up a hierarchy of the most weighted nodes just underlines all the hypotheses. Based on this evidences it would be a important preventing step if the security agencies would arrest these two persons. The other actors also exude danger but based on our findings we think that they are lost without the main protagonists.